Looking for the method executors uses to write to HDFS

2015-11-04 Thread Tóth Zoltán
Hi,

I'd like to write a parquet file from the driver. I could use the HDFS API
but I am worried that it won't work on a secure cluster. I assume that the
method the executors use to write to HDFS takes care of managing Hadoop
security. However, I can't find the place where HDFS write happens in the
spark source.

Please help me:
1.How to write parquet from the driver using the Spark API?
2. If this wouldn't possible, where can I find the method executors use to
write to HDFS?

Thanks,
Zoltan


Using ML KMeans without hardcoded feature vector creation

2015-09-15 Thread Tóth Zoltán
Hi,

I'm wondering if there is a concise way to run ML KMeans on a DataFrame if
I have the features in multiple numeric columns.

I.e. as in the Iris dataset:
(a1=5.1, a2=3.5, a3=1.4, a4=0.2, id=u'id_1', label=u'Iris-setosa',
binomial_label=1)

I'd like to use KMeans without recreating the DataSet with the feature
vector added manually as a new column and the original columns hardcoded
repeatedly in the code.

The solution I'd like to improve:

from pyspark.mllib.linalg import Vectors
from pyspark.sql.types import Row
from pyspark.ml.clustering import KMeans, KMeansModel

iris = sqlContext.read.parquet("/opt/data/iris.parquet")
iris.first()
# Row(a1=5.1, a2=3.5, a3=1.4, a4=0.2, id=u'id_1', label=u'Iris-setosa',
binomial_label=1)

df = iris.map(lambda r: Row(
id = r.id,
a1 = r.a1,
a2 = r.a2,
a3 = r.a3,
a4 = r.a4,
label = r.label,
binomial_label=r.binomial_label,
features = Vectors.dense(r.a1, r.a2, r.a3, r.a4))
).toDF()


kmeans_estimator = KMeans()\
.setFeaturesCol("features")\
.setPredictionCol("prediction")\
kmeans_transformer = kmeans_estimator.fit(df)

predicted_df = kmeans_transformer.transform(df).drop("features")
predicted_df.first()

# Row(a1=5.1, a2=3.5, a3=1.4, a4=0.2, binomial_label=1, id=u'id_1',
label=u'Iris-setosa', prediction=1)



I'm looking for a solution, which is something like:

feature_cols = ["a1", "a2", "a3", "a4"]

prediction_col_name = "prediction"






Thanks,

Zoltan


Re: OutOfMemory error with Spark ML 1.5 logreg example

2015-09-07 Thread Tóth Zoltán
Unfortunately I'm getting the same error:
The other interesting things are that:
 - the parquet files got actually written to HDFS (also with
.write.parquet() )
 - the application gets stuck in the RUNNING state for good even after the
error is thrown

15/09/07 10:01:10 INFO spark.ContextCleaner: Cleaned accumulator 19
15/09/07 10:01:10 INFO spark.ContextCleaner: Cleaned accumulator 5
15/09/07 10:01:12 INFO spark.ContextCleaner: Cleaned accumulator 20
Exception in thread "Thread-7"
Exception: java.lang.OutOfMemoryError thrown from the
UncaughtExceptionHandler in thread "Thread-7"
Exception in thread "org.apache.hadoop.hdfs.PeerCache@4070d501"
Exception: java.lang.OutOfMemoryError thrown from the
UncaughtExceptionHandler in thread
"org.apache.hadoop.hdfs.PeerCache@4070d501"
Exception in thread "LeaseRenewer:r...@docker.rapidminer.com:8020"
Exception: java.lang.OutOfMemoryError thrown from the
UncaughtExceptionHandler in thread
"LeaseRenewer:r...@docker.rapidminer.com:8020"
Exception in thread "Reporter"
Exception: java.lang.OutOfMemoryError thrown from the
UncaughtExceptionHandler in thread "Reporter"
Exception in thread "qtp2134582502-46"
Exception: java.lang.OutOfMemoryError thrown from the
UncaughtExceptionHandler in thread "qtp2134582502-46"




On Mon, Sep 7, 2015 at 3:48 PM, boci  wrote:

> Hi,
>
> Can you try to using save method instead of write?
>
> ex: out_df.save("path","parquet")
>
> b0c1
>
>
> --
> Skype: boci13, Hangout: boci.b...@gmail.com
>
> On Mon, Sep 7, 2015 at 3:35 PM, Zoltán Tóth  wrote:
>
>> Aaand, the error! :)
>>
>> Exception in thread "org.apache.hadoop.hdfs.PeerCache@4e000abf"
>> Exception: java.lang.OutOfMemoryError thrown from the 
>> UncaughtExceptionHandler in thread 
>> "org.apache.hadoop.hdfs.PeerCache@4e000abf"
>> Exception in thread "Thread-7"
>> Exception: java.lang.OutOfMemoryError thrown from the 
>> UncaughtExceptionHandler in thread "Thread-7"
>> Exception in thread "LeaseRenewer:r...@docker.rapidminer.com:8020"
>> Exception: java.lang.OutOfMemoryError thrown from the 
>> UncaughtExceptionHandler in thread 
>> "LeaseRenewer:r...@docker.rapidminer.com:8020"
>> Exception in thread "Reporter"
>> Exception: java.lang.OutOfMemoryError thrown from the 
>> UncaughtExceptionHandler in thread "Reporter"
>> Exception in thread "qtp2115718813-47"
>> Exception: java.lang.OutOfMemoryError thrown from the 
>> UncaughtExceptionHandler in thread "qtp2115718813-47"
>>
>> Exception: java.lang.OutOfMemoryError thrown from the 
>> UncaughtExceptionHandler in thread "sparkDriver-scheduler-1"
>>
>> Log Type: stdout
>>
>> Log Upload Time: Mon Sep 07 09:03:01 -0400 2015
>>
>> Log Length: 986
>>
>> Traceback (most recent call last):
>>   File "spark-ml.py", line 33, in 
>> out_df.write.parquet("/tmp/logparquet")
>>   File 
>> "/var/lib/hadoop-yarn/cache/yarn/nm-local-dir/usercache/root/appcache/application_1441224592929_0022/container_1441224592929_0022_01_01/pyspark.zip/pyspark/sql/readwriter.py",
>>  line 422, in parquet
>>   File 
>> "/var/lib/hadoop-yarn/cache/yarn/nm-local-dir/usercache/root/appcache/application_1441224592929_0022/container_1441224592929_0022_01_01/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
>>  line 538, in __call__
>>   File 
>> "/var/lib/hadoop-yarn/cache/yarn/nm-local-dir/usercache/root/appcache/application_1441224592929_0022/container_1441224592929_0022_01_01/pyspark.zip/pyspark/sql/utils.py",
>>  line 36, in deco
>>   File 
>> "/var/lib/hadoop-yarn/cache/yarn/nm-local-dir/usercache/root/appcache/application_1441224592929_0022/container_1441224592929_0022_01_01/py4j-0.8.2.1-src.zip/py4j/protocol.py",
>>  line 300, in get_return_value
>> py4j.protocol.Py4JJavaError
>>
>>
>>
>> On Mon, Sep 7, 2015 at 3:27 PM, Zoltán Tóth 
>> wrote:
>>
>>> Hi,
>>>
>>> When I execute the Spark ML Logisitc Regression example in pyspark I run
>>> into an OutOfMemory exception. I'm wondering if any of you experienced the
>>> same or has a hint about how to fix this.
>>>
>>> The interesting bit is that I only get the exception when I try to write
>>> the result DataFrame into a file. If I only "print" any of the results, it
>>> all works fine.
>>>
>>> My Setup:
>>> Spark 1.5.0-SNAPSHOT built for Hadoop 2.6.0 (I'm working with the latest
>>> nightly build)
>>> Build flags: -Psparkr -Phadoop-2.6 -Phive -Phive-thriftserver -Pyarn
>>> -DzincPort=3034
>>>
>>> I'm using the default resource setup
>>> 15/09/07 08:49:04 INFO yarn.YarnAllocator: Will request 2 executor
>>> containers, each with 1 cores and 1408 MB memory including 384 MB overhead
>>> 15/09/07 08:49:04 INFO yarn.YarnAllocator: Container request (host: Any,
>>> capability: )
>>> 15/09/07 08:49:04 INFO yarn.YarnAllocator: Container request (host: Any,
>>> capability: )
>>>
>>>