# C++ get individual words from a string



## rabidgnome229

If you're using cstring you can use strtok. Not sure about a C++ equivalent, but the concept you're looking for is tokenizing strings. I'm sure punching "C++ tokenize string" into google will get you the answer you need


----------



## Plex

Tokenizing is truncating strings using delimiters.

Can you give me an example of what you're trying to accomplish? I'm not sure if you're trying to tokenize or not, your OP wasn't very clear.


----------



## superhead91

Stringstreams should work. Use an istringstream and a string. Example:

Code:



Code:


int main(){

  istringstream ss;
  string s;

  while(ss >> s){
      do stuff with your string (individual words) here;
  }

  return 0;
}


----------



## flushentitypacket

Sounds like strtok is what I need. Thanks!

Plex: Here is an example.

This is the lyrics of a song
aren't they great?

I need: this, is, the, lyrics, of, a, song, aren't, they, great

Actually, I just realized this is a problem. How do I make sure all of the character cases are the same? (i.e. This->this) I'm sure a quick google search will do. Don't worry about it: consider the case closed unless I post otherwise.

Thanks everyone!


----------



## Coma

strcmp and variants


----------



## Xazen

OP, are you using string objects (the C++ thing to do) or are you strings simply char arrays (aka cstrings)?

If they are string objects (probable) then you could use strtok as follows, assuming the input string is named 'input':

Code:



Code:


string words[350]; //max of 350 words
int index;

index = 0;
words[index] = strtok(input.c_str," ,.-");
while(words[index] != NULL)
{
   words[index] = strtok(NULL," ,.-");
   index++;
}

Its very crude, but it should store each work in the string array (as long as there are less than 350).


----------



## Plex

Quote:


> Originally Posted by *flushentitypacket;13222821*
> Sounds like strtok is what I need. Thanks!
> 
> Plex: Here is an example.
> 
> This is the lyrics of a song
> aren't they great?
> 
> I need: this, is, the, lyrics, of, a, song, aren't, they, great
> 
> Actually, I just realized this is a problem. How do I make sure all of the character cases are the same? (i.e. This->this) I'm sure a quick google search will do. Don't worry about it: consider the case closed unless I post otherwise.
> 
> Thanks everyone!


Splitting by whitespace is super easy, stringstreams can do all that legwork.

Code:



Code:


string song("This is the lyrics of a song aren't they great?");
string songBuf;
stringstream ssSong(song);
vector<string> lyrics;

while (ssSong >> songBuf)
tokens.push_back(songBuf);

The stringstream basically shoots bits back and forth every time it hits a space, and then dumps just the word back into the vector.

As far as the case goes, C++ has built-in functions for that-- toupper() and tolower(). You can use these to make sure it's all lower case before or after the transformation above. I would recommend before.


----------



## flushentitypacket

I've decided to implement using cstrings.

I'm converting from a string to cstring, but I don't understand why the sample code from the C++ Reference makes an array of str.size()+1 instead of just str.size. Anyone know why?

// strings and c-strings
#include
#include
#include
using namespace std;

int main ()
{
char * cstr, *p;

string str ("Please split this phrase into tokens");

cstr = new char [str.size()+1];
strcpy (cstr, str.c_str());

// cstr now contains a c-string copy of str

p=strtok (cstr," ");
while (p!=NULL)
{
cout << p << endl;
p=strtok(NULL," ");
}

delete[] cstr;
return 0;
}


----------



## flushentitypacket

..


----------



## superhead91

Quote:


> Originally Posted by *flushentitypacket;13231172*
> I've decided to implement using cstrings.
> 
> I'm converting from a string to cstring, but I don't understand why the sample code from the C++ Reference makes an array of str.size()+1 instead of just str.size. Anyone know why?


I believe it's because a cstring is ended with the '\0' character, therefore it needs one more character than a c++ string.
Example: c++ string "hello" cstring "hello\0" . '\0' is considered as one character just like '\n'. When you print out the cstring it will not print out the '\0'.


----------



## ghell

Quote:


> Originally Posted by *superhead91;13231272*
> I believe it's because a cstring is ended with the '\0' character, therefore it needs one more character than a c++ string.


C++ strings also terminate with a nul character ('\0').

Neither string::size in C++ nor strlen in C count the terminating character in their output.

That is, the length of the string "foobar" is returned as 6 in both, but also takes at least 7 bytes of memory to store in both.

When you define the string literal inside the code rather than reading it from an external source such as a file or standard input, just using the quote marks will add the \0 when you compile it. That means you never need to (and probably shouldn't) actually type "foobar\0" and also that if you have a char[6] and then assign "foobar" to it, it will overflow a 0 byte past the end of it on the stack.

This is all assuming a <= 8 bit character set. UTF-16 terminates with two 0 bytes (not too difficult to use, just L"foobar") and may also start with a byte order mark, and if you're using something like a TCHAR you may not even know what character set is being used.


----------

