Re: pyspark+spacy throwing pickling exception

2018-02-15 Thread Holden Karau
So you left out the exception. On one hand I’m also not sure how well spacy
serializes, so to debug this I would start off by moving the nlp = inside
of my function and see if it still fails.

On Thu, Feb 15, 2018 at 9:08 PM Selvam Raman  wrote:

> import spacy
>
> nlp = spacy.load('en')
>
>
>
> def getPhrases(content):
> phrases = []
> doc = nlp(str(content))
> for chunks in doc.noun_chunks:
> phrases.append(chunks.text)
> return phrases
>
> the above function will retrieve the noun phrases from the content and
> return list of phrases.
>
>
> def f(x) : print(x)
>
>
> description = 
> xmlData.filter(col("dcterms:description").isNotNull()).select(col("dcterms:description").alias("desc"))
>
> description.rdd.flatMap(lambda row: getPhrases(row.desc)).foreach(f)
>
> when i am trying to access getphrases i am getting below exception
>
>
>
> --
> Selvam Raman
> "லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"
>
-- 
Twitter: https://twitter.com/holdenkarau


Pyspark UDF/map fucntion throws pickling exception

2018-02-15 Thread Selvam Raman
import spacy

nlp = spacy.load('en')



def getPhrases(content):
phrases = []
doc = nlp(str(content))
for chunks in doc.noun_chunks:
phrases.append(chunks.text)
return phrases

the above function will retrieve the noun phrases from the content and
return list of phrases.


def f(x) : print(x)


description = 
xmlData.filter(col("dcterms:description").isNotNull()).select(col("dcterms:description").alias("desc"))

description.rdd.flatMap(lambda row: getPhrases(row.desc)).foreach(f)

when i am trying to access getphrases i am getting below exception

"""if islambda(obj) or obj.__code__.co_filename == '' or themodule
is None:
AttributeError: 'builtin_function_or_method' object has no attribute
'__code__' """

Full stack trace is below

Traceback (most recent call last):
  File
"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pyspark/cloudpickle.py",
line 148, in dump
return Pickler.dump(self, obj)
  File
"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pickle.py",
line 409, in dump
self.save(obj)
  File
"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pickle.py",
line 476, in save
f(self, obj) # Call unbound method with explicit self
  File
"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pickle.py",
line 751, in save_tuple
save(element)
  File
"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pickle.py",
line 476, in save
f(self, obj) # Call unbound method with explicit self
  File
"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pyspark/cloudpickle.py",
line 255, in save_function
self.save_function_tuple(obj)
  File
"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pyspark/cloudpickle.py",
line 292, in save_function_tuple
save((code, closure, base_globals))
  File
"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pickle.py",
line 476, in save
f(self, obj) # Call unbound method with explicit self
  File
"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pickle.py",
line 736, in save_tuple
save(element)
  File
"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pickle.py",
line 476, in save
f(self, obj) # Call unbound method with explicit self
  File
"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pickle.py",
line 781, in save_list
self._batch_appends(obj)
  File
"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pickle.py",
line 805, in _batch_appends
save(x)
  File
"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pickle.py",
line 476, in save
f(self, obj) # Call unbound method with explicit self
  File
"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pyspark/cloudpickle.py",
line 255, in save_function
self.save_function_tuple(obj)
  File
"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pyspark/cloudpickle.py",
line 292, in save_function_tuple
save((code, closure, base_globals))
  File
"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pickle.py",
line 476, in save
f(self, obj) # Call unbound method with explicit self
  File
"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pickle.py",
line 736, in save_tuple
save(element)
  File
"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pickle.py",
line 476, in save
f(self, obj) # Call unbound method with explicit self
  File
"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pickle.py",
line 781, in save_list
self._batch_appends(obj)
  File
"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pickle.py",
line 805, in _batch_appends
save(x)
  File
"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pickle.py",
line 476, in save
f(self, obj) # Call unbound method with explicit self
  File
"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pyspark/cloudpickle.py",
line 255, in save_function
self.save_function_tuple(obj)
  File
"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pyspark/cloudpickle.py",
line 292, in save_function_tuple
save((code, closure, base_globals))
  File
"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pickle.py",
line 476, in save
f(self, obj) # Call unbound method with explicit self
  File
"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pickle.py",
line 736, in save_tuple
save(element)
  File
"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pickle.py",
line 476, in save
f(self, obj) # Call unbound method with explicit self
  File
"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pickle.py",
line 781, in save_list
self._batch_appends(obj)
  File
"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pickle.py",

pyspark+spacy throwing pickling exception

2018-02-15 Thread Selvam Raman
import spacy

nlp = spacy.load('en')



def getPhrases(content):
phrases = []
doc = nlp(str(content))
for chunks in doc.noun_chunks:
phrases.append(chunks.text)
return phrases

the above function will retrieve the noun phrases from the content and
return list of phrases.


def f(x) : print(x)


description = 
xmlData.filter(col("dcterms:description").isNotNull()).select(col("dcterms:description").alias("desc"))

description.rdd.flatMap(lambda row: getPhrases(row.desc)).foreach(f)

when i am trying to access getphrases i am getting below exception



-- 
Selvam Raman
"லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"


Re: Pyspark UDF/map fucntion throws pickling exception

2018-02-15 Thread Selvam Raman
pyspark - 2.2.1
spacy - 2.0.7
python - 3.6


Placing full logs here

Traceback (most recent call last):
  File
"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pyspark/cloudpickle.py",
line 148, in dump
return Pickler.dump(self, obj)
  File
"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pickle.py",
line 409, in dump
self.save(obj)
  File
"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pickle.py",
line 476, in save
f(self, obj) # Call unbound method with explicit self
  File
"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pickle.py",
line 751, in save_tuple
save(element)
  File
"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pickle.py",
line 476, in save
f(self, obj) # Call unbound method with explicit self
  File
"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pyspark/cloudpickle.py",
line 255, in save_function
self.save_function_tuple(obj)
  File
"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pyspark/cloudpickle.py",
line 292, in save_function_tuple
save((code, closure, base_globals))
  File
"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pickle.py",
line 476, in save
f(self, obj) # Call unbound method with explicit self
  File
"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pickle.py",
line 736, in save_tuple
save(element)
  File
"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pickle.py",
line 476, in save
f(self, obj) # Call unbound method with explicit self
  File
"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pickle.py",
line 781, in save_list
self._batch_appends(obj)
  File
"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pickle.py",
line 805, in _batch_appends
save(x)
  File
"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pickle.py",
line 476, in save
f(self, obj) # Call unbound method with explicit self
  File
"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pyspark/cloudpickle.py",
line 255, in save_function
self.save_function_tuple(obj)
  File
"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pyspark/cloudpickle.py",
line 292, in save_function_tuple
save((code, closure, base_globals))
  File
"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pickle.py",
line 476, in save
f(self, obj) # Call unbound method with explicit self
  File
"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pickle.py",
line 736, in save_tuple
save(element)
  File
"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pickle.py",
line 476, in save
f(self, obj) # Call unbound method with explicit self
  File
"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pickle.py",
line 781, in save_list
self._batch_appends(obj)
  File
"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pickle.py",
line 805, in _batch_appends
save(x)
  File
"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pickle.py",
line 476, in save
f(self, obj) # Call unbound method with explicit self
  File
"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pyspark/cloudpickle.py",
line 255, in save_function
self.save_function_tuple(obj)
  File
"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pyspark/cloudpickle.py",
line 292, in save_function_tuple
save((code, closure, base_globals))
  File
"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pickle.py",
line 476, in save
f(self, obj) # Call unbound method with explicit self
  File
"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pickle.py",
line 736, in save_tuple
save(element)
  File
"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pickle.py",
line 476, in save
f(self, obj) # Call unbound method with explicit self
  File
"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pickle.py",
line 781, in save_list
self._batch_appends(obj)
  File
"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pickle.py",
line 805, in _batch_appends
save(x)
  File
"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pickle.py",
line 476, in save
f(self, obj) # Call unbound method with explicit self
  File
"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pyspark/cloudpickle.py",
line 255, in save_function
self.save_function_tuple(obj)
  File
"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pyspark/cloudpickle.py",
line 292, in save_function_tuple
save((code, closure, base_globals))
  File
"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pickle.py",
line 476, in save
f(self, obj) # Call unbound method 

Re: pyspark+spacy throwing pickling exception

2018-02-15 Thread Selvam Raman
Hi ,

i solved the issue when i extract the method into another class.

Failure:
Class extract.py - contains the whole implementation.
Because of this single class driver trying to serialize spacy(english)
object and sending to executor. There i am facing pickling exception.

Success:
Class extract.py - it referring getPhrase method of spacyutils
Class spacytuils.py

Now, spacy initialized in executor, there is no need of serialization.

Please let me know my understanding is correct.


On Thu, Feb 15, 2018 at 12:14 PM, Holden Karau  wrote:

> So you left out the exception. On one hand I’m also not sure how well
> spacy serializes, so to debug this I would start off by moving the nlp =
> inside of my function and see if it still fails.
>
> On Thu, Feb 15, 2018 at 9:08 PM Selvam Raman  wrote:
>
>> import spacy
>>
>> nlp = spacy.load('en')
>>
>>
>>
>> def getPhrases(content):
>> phrases = []
>> doc = nlp(str(content))
>> for chunks in doc.noun_chunks:
>> phrases.append(chunks.text)
>> return phrases
>>
>> the above function will retrieve the noun phrases from the content and
>> return list of phrases.
>>
>>
>> def f(x) : print(x)
>>
>>
>> description = 
>> xmlData.filter(col("dcterms:description").isNotNull()).select(col("dcterms:description").alias("desc"))
>>
>> description.rdd.flatMap(lambda row: getPhrases(row.desc)).foreach(f)
>>
>> when i am trying to access getphrases i am getting below exception
>>
>>
>>
>> --
>> Selvam Raman
>> "லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"
>>
> --
> Twitter: https://twitter.com/holdenkarau
>



-- 
Selvam Raman
"லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"