I realized that in the Spark ML, BinaryClassifcationMetrics only supports
AreaUnderPR and AreaUnderROC. Why is that? I
What if I need other metrics such as F-score, accuracy? I tried to use
MulticlassClassificationEvaluator to evaluate other metrics such as
Accuracy for a binary classification
Hi, there
I am trying to generate unique ID for each record in a dataframe, so that I
can save the dataframe to a relational database table. My question is that
when the dataframe is regenerated due to executor failure or being evicted
out of cache, does the ID keeps the same as before?
Hi, there
From the Spark UI, we can monitor the following two metrics:
• Processing Time - The time to process each batch of data.
• Scheduling Delay - the time a batch waits in a queue for the
processing of previous batches to finish.
However, what is the best way to monitor
Hello,
My understanding is that YARN executor container memory is based on
"spark.executor.memory" + “spark.yarn.executor.memoryOverhead”. The first one
is for heap memory and second one is for offheap memory. The
spark.executor.memory is used by -Xmx to set the max heap size. Now my question
but I am worried if it is single large file.
>>
>> In this case, this would only work in single executor which I think will
>> end up with OutOfMemoryException.
>>
>> Spark JSON data source does not support multi-line JSON as input due to
>> the limitation of TextInpu
Hi, there
Spark has provided json document processing feature for a long time. In
most examples I see, each line is a json object in the sample file. That is
the easiest case. But how can we process a json document, which does not
conform to this standard format (one line per json object)? Here
p where as online training can be hard and needs to be
> constantly monitored to see how it is performing.
>
> Hope this helps in understanding offline learning vs. online learning and
> which algorithms you can choose for online learning in MLlib.
>
> Guru Medasani
> gdm...@
Sorry, accidentally sent again. My apology.
> On Mar 6, 2016, at 1:22 PM, Lan Jiang <ljia...@gmail.com> wrote:
>
> Hi, there
>
> I hope someone can clarify this for me. It seems that some of the MLlib
> algorithms such as KMean, Linear Regression and Logistics Regres
Hi, there
I hope someone can clarify this for me. It seems that some of the MLlib
algorithms such as KMean, Linear Regression and Logistics Regression have a
Streaming version, which can do online machine learning. But does that mean
other MLLib algorithm cannot be used in Spark streaming
Hi, there
I hope someone can clarify this for me. It seems that some of the MLlib
algorithms such as KMean, Linear Regression and Logistics Regression have a
Streaming version, which can do online machine learning. But does that mean
other MLLib algorithm cannot be used in Spark streaming
Hi, there
I am looking at the SparkSQL setting spark.sql.autoBroadcastJoinThreshold.
According to the programming guide
*Note that currently statistics are only supported for Hive Metastore
tables where the command ANALYZE TABLE COMPUTE STATISTICS
noscan has been run.*
My question is that is
Michael,
Thanks for the reply.
On Wed, Feb 10, 2016 at 11:44 AM, Michael Armbrust
wrote:
> My question is that is "NOSCAN" option a must? If I execute "ANALYZE TABLE
>> compute statistics" command in Hive shell, is the statistics
>> going to be used by SparkSQL to
Need some clarification about the documentation. According to Spark doc
"the default interval is a multiple of the batch interval that is at least 10
seconds. It can be set by using dstream.checkpoint(checkpointInterval).
Typically, a checkpoint interval of 5 - 10 sliding intervals of a DStream
For Spark data science project, Python might be a good choice. However, for
Spark streaming, Python API is still lagging. For example, for Kafka no
receiver connector, according to the Spark 1.5.2 doc: "Spark 1.4 added a
Python API, but it is not yet at full feature parity”.
Java does not
I have not run into any linkage problem, but maybe I was lucky. :-). The
reason I wanted to use protobuf 3 is mainly for Map type support.
On Thu, Nov 5, 2015 at 4:43 AM, Steve Loughran <ste...@hortonworks.com>
wrote:
>
> > On 5 Nov 2015, at 00:12, Lan Jiang <ljia...@gmail.com
I have used protobuf 3 successfully with Spark on CDH 5.4, even though Hadoop
itself comes with protobuf 2.5. I think the steps apply to HDP too. You need to
do the following
1. Set the below parameter
spark.executor.userClassPathFirst=true
spark.driver.userClassPathFirst=true
2. Include
As Francois pointed out, you are encountering a classic small file
anti-pattern. One solution I used in the past is to wrap all these small binary
files into a sequence file or avro file. For example, the avro schema can have
two fields: filename: string and binaryname:byte[]. Thus your file is
s/1.4.0/api/java/org/apache/spark/SparkContext.html#wholeTextFiles(java.lang.String,%20int)>
>
> On 20 October 2015 at 15:04, Lan Jiang <ljia...@gmail.com
> <mailto:ljia...@gmail.com>> wrote:
> As Francois pointed out, you are encountering a classic small file
> anti-patt
On Mon, Oct 12, 2015 at 8:34 AM, Akhil Das <ak...@sigmoidanalytics.com>
wrote:
> Can you look a bit deeper in the executor logs? It could be filling up the
> memory and getting killed.
>
> Thanks
> Best Regards
>
> On Mon, Oct 5, 2015 at 8:55 PM, Lan Jiang <ljia...@gma
il.com> wrote:
>
> Hi Lan, thanks for the response yes I know and I have confirmed in UI that it
> has only 12 partitions because of 12 HDFS blocks and hive orc file strip size
> is 33554432.
>
> On Thu, Oct 8, 2015 at 11:55 PM, Lan Jiang <ljia...@gmail.com
>
Hi, there
I have a spark batch job running on CDH5.4 + Spark 1.3.0. Job is submitted in
“yarn-client” mode. The job itself failed due to YARN kills several executor
containers because the containers exceeded the memory limit posed by YARN.
However, when I went to the YARN resource manager
The partition number should be the same as the HDFS block number instead of
file number. Did you confirmed from the spark UI that only 12 partitions were
created? What is your ORC orc.stripe.size?
Lan
> On Oct 8, 2015, at 1:13 PM, unk1102 wrote:
>
> Hi I have the
Hi, there
My understanding is that the cache storage is calculated as following
executor heap size * spark.storage.safetyFraction *
spark.storage.memoryFraction.
The default value for safetyFraction is 0.9 and memoryFraction is 0.6. When
I started a spark job on YARN, I set executor-memory to
, write the result to HDFS. I use spark 1.3 with
spark-avro (1.0.0). The error only happens when running on the whole
dataset. When running on 1/3 of the files, the same job completes without
error.
On Thu, Oct 1, 2015 at 2:41 PM, Lan Jiang <ljia...@gmail.com> wrote:
> Hi, there
Hi, there
When running a Spark job on YARN, 2 executors somehow got lost during the
execution. The message on the history server GUI is “CANNOT find address”. Two
extra executors were launched by YARN and eventually finished the job. Usually
I go to the “Executors” tab on the UI to check the
15 at 1:30 PM, Ted Yu <yuzhih...@gmail.com> wrote:
> Can you go to YARN RM UI to find all the attempts for this Spark Job ?
>
> The two lost executors should be found there.
>
> On Thu, Oct 1, 2015 at 10:30 AM, Lan Jiang <ljia...@gmail.com> wrote:
>
>> Hi, t
Hi, there
Here is the problem I ran into when executing a Spark Job (Spark 1.3). The
spark job is loading a bunch of avro files using Spark SQL spark-avro 1.0.0
library. Then it does some filter/map transformation, repartition to 1
partition and then write to HDFS. It creates 2 stages. The total
Hi, there
I ran into an issue when using Spark (v 1.3) to load avro file through Spark
SQL. The code sample is below
val df = sqlContext.load(“path-to-avro-file","com.databricks.spark.avro”)
val myrdd = df.select(“Key", “Name", “binaryfield").rdd
val results = myrdd.map(...)
val finalResults =
using spark-submit.
>
> If that doesn't work, I'd recommend shading, as others already have.
>
> On Tue, Sep 15, 2015 at 9:19 AM, Lan Jiang <ljia...@gmail.com> wrote:
>> I used the --conf spark.files.userClassPathFirst=true in the spark-shell
>> option, it still ga
996>
>
> Yong
>
> Subject: Re: Change protobuf version or any other third party library version
> in Spark application
> From: ste...@hortonworks.com <mailto:ste...@hortonworks.com>
> To: ljia...@gmail.com <mailto:ljia...@gmail.com>
> CC: user@spark.apache.org <mailto:u
ste...@hortonworks.com
> To: ljia...@gmail.com
> CC: user@spark.apache.org
> Date: Tue, 15 Sep 2015 09:19:28 +
>
>
>
>
> On 15 Sep 2015, at 05:47, Lan Jiang <ljia...@gmail.com> wrote:
>
> Hi, there,
>
> I am using Spark 1.4.1. The protobuf 2.5 is included b
Hi, there
I ran into a problem when I try to pass external jar file to spark-shell.
I have a uber jar file that contains all the java codes I created for protobuf
and all its dependency.
If I simply execute my code using Scala Shell, it works fine without error. I
use -cp to pass the
Hi, there,
I am using Spark 1.4.1. The protobuf 2.5 is included by Spark 1.4.1 by
default. However, I would like to use Protobuf 3 in my spark application so
that I can use some new features such as Map support. Is there anyway to
do that?
Right now if I build a uber.jar with dependencies
YARN capacity scheduler support hierarchical queues, which you can assign
cluster resource as percentage. Your spark application/shell can be
submitted to different queues. Mesos supports fine-grained mode, which
allows the machines/cores used each executors ramp up and down.
Lan
On Wed, Apr 22,
Rename your log4j_special.properties file as log4j.properties and place it
under the root of your jar file, you should be fine.
If you are using Maven to build your jar, please the log4j.properties in the
src/main/resources folder.
However, please note that if you have other dependency jar
that going this way I will not be able to run several applications
on the same cluster in parallel?
Regarding submit, I am not using it now, I submit from the code, but I think
I should consider this option.
Thanks.
On Mon, Apr 20, 2015 at 5:59 PM, Lan Jiang ljia...@gmail.com
36 matches
Mail list logo