Removing line break to make text *flow* Thread poster: Mette Hansen
| Mette Hansen Denmark Local time: 12:23 Member (2002) English to Danish + ...
I just saved a PDF file to RTF in order to make a TM in WinAlign, but the problem is the line breaks. The lines break in the middle of a sentence in steaf of at the end of the sentence at the full-stop.
Does anybody know how to remove line breaks and make the text flow?
Thanks in advance. | | | Ralf Lemster Germany Local time: 12:23 English to German + ... Search and Replace | Nov 21, 2003 |
You can use Search and Replace, searching for a paragraph break and replacing it with nothing (or a space, depending on your text structure).
BTW saving PDFs as .doc files has improved significantly with Acrobat 6.
HTH, Ralf | | | just do this | Nov 21, 2003 |
You may have a Word program with an interface in a different language but these commands don't change. Open the search/replace window and type ^p in the search line and leave the replace section empty. This way you can remove all the paragraph breaks. | | | sylver Local time: 19:23 English to French A little pointer | Nov 22, 2003 |
There is no way to do that perfectly in one shot, but there are ways to minimize the damages. Try the import function of wordfast. It tries to guess where the paragraph marks should be left and where they should be removed. (That can be done very easily with the demo version - free)
Check the manual for instructions and settings. If your PDF is long, that little trick could save you hours.
[Edited at 2003-11-22 14:18] | |
|
|
sylver Local time: 19:23 English to French
Line break and paragraphe marks are not the same. What you have are paragraph marks, (^P or ^013 for search purposes) whereas a line break is (^l) | | | Ari Nuncio United States Local time: 05:23 Spanish to English + ... Another approach | Nov 22, 2003 |
I agree with Sylver: this is not going to be a one-shot operation, especially if you don't have WordRight or the demo (which now limits you to files of 110 K or less). But there's a slightly more sophisticated approach that could save you from eliminating all paragraph marks indiscriminately (some of which correspond to real paragraphs) or (on large documents) removing hundreds or even thousands of paragraphs one by one.
In Word, open the Search and Replace module. Type " ^p" in the... See more I agree with Sylver: this is not going to be a one-shot operation, especially if you don't have WordRight or the demo (which now limits you to files of 110 K or less). But there's a slightly more sophisticated approach that could save you from eliminating all paragraph marks indiscriminately (some of which correspond to real paragraphs) or (on large documents) removing hundreds or even thousands of paragraphs one by one.
In Word, open the Search and Replace module. Type " ^p" in the find box and "^p" in the replace box. Notice that there's a space before the paragraph symbol before the "^p" in the find box. What you're doing here is replacing all paragraph marks with a space before them. This may not be necessary for PDF imports, but let's assume you want a technique that will work for any document with unnecessary paragraph marks (I still get them on a regular basis). Hit the "Replace All" button. Now do it again, to make sure you have no extra spaces at the end of lines.
Before I describe the rest of the process, allow me to point out that there's a way to avoid that extra step (indeed, there's a way to make this whole process fully automated) using Visual Basic for Applications. If you're using Word 2000 or XP, I'd be happy to provide you with a template that you can use for just this purpose.
Next step. In Search and Replace, under Search Options, click the box that says Wildcards. In the Find box, type "[a-z]^13[a-z]." Leave the Replace box blank, but use the Format button to select Highlight. What you're doing here is highlighting all paragraph marks that have been placed between two lower-case letters.
Under Search Options, deselect Wildcards. Now type "^p" in the Find box. Hit the Format button and select Highlight. The Find box should now have "^p" after "Find what" and it should say Format: Highlight under the box. Go to the Replace box and type in a blank space. Hit the Replace All button.
That takes care of all the paragraph marks between words in lower case. But if the document you're trying to clean has lots of letters in caps that are not the beginning of sentences (or, say, you're working on a document in German, which capitalizes all nouns), you'll need to perform another operation, this time involving a certain amount of risk.
The idea is to remove any paragraph mark that does not occur after the end of a sentence (i.e., ending in a period). The problem is that you may also remove paragraph marks after titles, which in many languages do not end in periods. So your decision to proceed with the next step depends on how prevalent capitalization is in the body of your text. If it's the exception and not the rule, then now would be a good time to start a semi-manual search (as described by others on this page) to remove remaining rogue paragraph marks.
Assuming heavy capitalization is the rule and not the exception in your text, type "[a-z]^13[A-Z]" in the Find box. As above, leave the replace box blank (making absolutely sure that there is no blank space " " in it), and use the Format button to select Highlight. Hit the Replace All button.
Now all sentences that end in a lower-case letter and begin with an upper-case letter have been highlighted. This could include titles. You'll need to go through your text and place a unique non-text character (not A-Z) at the end of each title line. Use a character that normally is not used in the language your text is in or one that you're sure does not appear anywhere else in the text. For the sake of this exercise, let's say that the character is "¿" (an upside-down question mark).
Now you're ready to replace all highlighted paragraphs again. So, as above: under Search Options, deselect Wildcards. Type "^p" in the Find box. Hit the Format button and select Highlight. The Find box should now have "^p" after "Find what" and it should say Format: Highlight under the box. Go to the Replace box and type in a blank space. Hit the Replace All button.
The last step is easy. We need to remove the upside-down question mark after titles. I suggested that you use a unique character that you're sure does not appear elsewhere in the text, but what if the text is huge and unpredictable, and you can't be sure of anything? Let's cover our bets by typing the following into the Find box: "¿^p". In the Replace box, type "^p". Replace All.
Any remaining issues will have to be corrected semi-manually. Admittedly, this process is not without drawbacks. But if the document in question is large, you will save yourself a lot of time.
If you'd like the template I mentioned, contact me at aanuncio@prodigy.net.mx. No charge, of course. ▲ Collapse | | | Mette Hansen Denmark Local time: 12:23 Member (2002) English to Danish + ... TOPIC STARTER Thank you so much!!! | Nov 23, 2003 |
Dear Ralf,
I used your method and it worked perfectly. You have just saved me countless of hours of work for this project and many future ones.
I thank you with all my heart.
Sincerely,
Mette
Ralf Lemster wrote:
You can use Search and Replace, searching for a paragraph break and replacing it with nothing (or a space, depending on your text structure).
BTW saving PDFs as .doc files has improved significantly with Acrobat 6.
HTH, Ralf | | |
[quote]Ari Nuncio wrote:
......
Next step. In Search and Replace, under Search Options, click the box that says Wildcards. In the Find box, type "[a-z]^13[a-z]." Leave the Replace box blank, but use the Format button to select Highlight. What you're doing here is highlighting all paragraph marks that have been placed between two lower-case letters.......
..........
Some of what you wrote in your post didn't work for me, some was over zealous, but most worked. After fiddling with it, over 500 rogue paragraph tags were removed from an 8000 word document. This will save me hours of work. Thanks! | |
|
|
Remove single line breaks | May 20, 2011 |
If the file is no longer than about 10 pages, it's best to just use find and replace manually. Using the keyboard shortcuts in the find and replace window, you can do a page in under 30 seconds.
If the file is too long for this and you need an autmated solution, the simplest option is to just remove single line breaks. It's a lot simpler than some of the ideas proposed above, and likely to work just as well.
When you export a pdf, paragraphs and other units are usually separat... See more If the file is no longer than about 10 pages, it's best to just use find and replace manually. Using the keyboard shortcuts in the find and replace window, you can do a page in under 30 seconds.
If the file is too long for this and you need an autmated solution, the simplest option is to just remove single line breaks. It's a lot simpler than some of the ideas proposed above, and likely to work just as well.
When you export a pdf, paragraphs and other units are usually separated by two line breaks, while the rogue line breaks are singles. Aligners segment the text anyway, so merging a couple of segments by accident isn't going to cause a huge problem; it's better to be a bit overzealous with merging than than to miss split segments by being too cautious. If all goes well, the segmenter will just split any accidentally merged segments again.
If you want to do this in Word, you could replace ^p^p with, say, XXX@@@linebreak@@@XXX, then replace ^p with a space (not with nothing as suggested above!). Then replace XXX@@@linebreak@@@XXX with ^p. You can also replace multiple spaces with single spaces in case the previous step introduced any superfluous spaces.
BTW LF Aligner does all this automatically if you feed it a pdf file (or the txt export of a pdf file).
[Edited at 2011-05-20 12:10 GMT] ▲ Collapse | | | To report site rules violations or get help, contact a site moderator: You can also contact site staff by submitting a support request » Removing line break to make text *flow* Trados Studio 2022 Freelance | The leading translation software used by over 270,000 translators.
Designed with your feedback in mind, Trados Studio 2022 delivers an unrivalled, powerful desktop
and cloud solution, empowering you to work in the most efficient and cost-effective way.
More info » |
| Protemos translation business management system | Create your account in minutes, and start working! 3-month trial for agencies, and free for freelancers!
The system lets you keep client/vendor database, with contacts and rates, manage projects and assign jobs to vendors, issue invoices, track payments, store and manage project files, generate business reports on turnover profit per client/manager etc.
More info » |
|
| | | | X Sign in to your ProZ.com account... | | | | | |